CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance
نویسندگان
چکیده
منابع مشابه
CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency b...
متن کاملA Survey on Linguistic Structures for Application-level Fault-Tolerance
The structures for the expression of fault-tolerance provisions into the application software are the central topic of this paper. Structuring techniques answer the questions \How to incorporate fault-tolerance in the application layer of a computer program" and \How to manage the faulttolerance code". As such, they provide means to control complexity, the latter being a relevant factor for the...
متن کاملExploiting Application-Level Correctness for Low-Cost Fault Tolerance
Traditionally, fault tolerance researchers have required architectural state to be numerically perfect for program execution to be correct. However, in many programs, even if execution is not 100% numerically correct, the program can still appear to execute correctly from the user’s perspective. Hence, whether a fault is unacceptable or benign may depend on the level of abstraction at which cor...
متن کاملPerformance Tradeoffs in Policies for Application Level Fault Tolerance
Object oriented applications and services are composed of a number of objects with instances, which interact to accomplish common goals. Fault tolerance is attained via application transparent replication policies for masking faults that do not recur after recovery. Recently, we realized the advent of a number of middleware infrastructures and services, which allow customizing the replication c...
متن کاملApplication-Level Resilience Modeling for HPC Fault Tolerance
Understanding the application resilience in the presence of faults is critical to address the HPC resilience challenge. Currently we largely rely on random fault injection (RFI) to quantify the application resilience. However, RFI provides lile information on how fault tolerance happens, and RFI results are oen not deterministic due to its random nature. In this paper, we introduce a new meth...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Parallel and Distributed Systems
سال: 2019
ISSN: 1045-9219,1558-2183,2161-9883
DOI: 10.1109/tpds.2018.2866794